This vignette outlines the steps of applying STEAM on the CODEX intestinal dataset.
Before using STEAM, we recommend preprocessing your spatial transcriptomics data using the Seurat framework, as it provides a robust set of tools for spatial data normalization, scaling, and visualization. The Satija Lab offers detailed vignettes to guide you through spatial data preprocessing:
To run STEAM, you need the following components:
library(STEAM)
## Loading required package: ggplot2
## Loading required package: RANN
## Loading required package: patchwork
## Loading required package: dplyr
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
## Package: STEAM
## Description: Evaluating consistency and reliability of clustering in spatial omics using STEAM
## Version: 1.2
## Release Date: 2025-02-01
## Authors: Samantha Reynoso, Courtney Schiebout, Revanth Krishna, Fan Zhang (University of Colorado School of Medicine - CU Medicine)
## Maintainer: Revanth Krishna <revanth.krishna@cuanschutz.edu>
library(Seurat)
## Loading required package: SeuratObject
## Loading required package: sp
##
## Attaching package: 'SeuratObject'
## The following objects are masked from 'package:base':
##
## intersect, t
CODEX <- readRDS("~/Desktop/STEAM/CODEX/old/CODEXData.RDS")
codex1 <- CODEX[[1]]
Parameters
Seurat.obj — a Seurat object with counts
and metadata.label.column — metadata column containing the
ground‑truthlabels (e.g., cortical layer, cell type).assay — which assay to read (e.g., "SCT",
"RNA").Outputs - STEAM.obj$count_exp,
STEAM.obj$labels, STEAM.obj$spatial (if
present in the Seurat object), and optional embeddings (PCA/UMAP) if
available.
You can input this data into STEAM in two ways:
# CODEX is your Seurat object
expr <- LayerData(codex1[["RNA"]], layer = "scale.data")
# 2) spatial coordinates from meta.data
coords <- as.data.frame(codex1@meta.data[, c("Xcorr","Ycorr")], stringsAsFactors = FALSE)
colnames(coords) <- c("x","y") # function expects x, y
coords$x <- as.numeric(coords$x) # just in case they’re characters
coords$y <- as.numeric(coords$y)
rownames(coords) <- Cells(codex1)
# 3) labels
labels <- codex1@meta.data$celltype
names(labels) <- Cells(codex1)
# (optional) reductions
pca <- Embeddings(codex1, "pca")
umap <- Embeddings(codex1, "umap")
STEAM.obj <- LoadSTEAM(
count_exp = expr, # genes x cells (or cells x genes; see package docs)
spatial = coords, # data.frame with rownames=cell_id and X/Y columns
labels = labels, # named vector: names = cell_id, values = label
pca = pca, # optional matrix/data.frame
umap = umap, # optional matrix/data.frame
clusters = NULL # optional alternative cluster labels
)
Parameters
count_exp — expression matrix.spatial — coordinates; must align rownames with cell
IDs used elsewhere.labels — named vector of reference labels (ground
truth).pca, umap, clusters —
optional; used for diagnostics/plots.Common checks - Row/column names must be consistent across objects.
labels must cover all cells you plan to evaluate.Key parameters
-mode — "nested" performs outer
and inner cross‑validation (outer for unbiased evaluation;
inner for tuning). -n_outer_folds,
n_inner_folds — fold counts. -cv.cores —
parallel workers for outer folds. -allowParallel — whether
caret may parallelize inner CV (often FALSE to avoid nested
parallelism contention). -model: The machine learning model
to use for classification.
- `rf` = RandomForest
- `svm` = Support Vector Machines,
- `xgb` = XGBoost
- `multinom` = Multinomial Logistic Regression
metric — caret optimization metric (e.g.,
"Accuracy", "Kappa").maxnweights, n.size — package‑specific
compute/feature knobs.Outputs
STEAM.obj$nested$ncv — nested CV object
containing:outer_result[[i]]$preds — outer test
predictions per fold with predy (prediction),
testy (truth), and per‑class probabilities.outer_result[[i]]$fit$pred — inner CV
predictions used for tuning.NOTE: custom model parameters and grids are at the bottom of this document.
STEAM.obj <- RunSTEAM(
STEAM.obj, mode = "nested",
n_outer_folds = 5, n_inner_folds = 3,
cv.cores = 10,
allowParallel = FALSE,
n.size = 5,
model = "rf",
metric = "Kappa",
maxnweights = 15000
)
## Creating nested CV folds using method: stratified
## === STRATIFIED CV DIAGNOSTICS ===
## Method used: stratified
## Outer folds: 5
## Fold sizes: 4795, 4791, 4789, 4785, 4783
## Fold size balance ratio: 1
## Overall class balance ratio: 1.05 (closer to 1.0 = better balance)
##
## === STRATIFICATION QUALITY METRICS ===
## Overall Quality Index: 0.981 (0-1 scale, higher = better, calibrated)
## Overall Chi-squared: 0.097 (lower = better, threshold=142.3)
## Overall CV: 0.003 (lower = better, threshold=0.3)
## Overall MAD: 0 (lower = better, threshold=0.073)
##
## Normalized penalty components:
## Chi-squared penalty: 0.001 (weight: 0.4)
## CV penalty: 0.01 (weight: 0.35)
## MAD penalty: 0.002 (weight: 0.25)
## Combined penalty: 0.004 (0=perfect, 1=poor)
##
## Per-class quality metrics:
## T_Cells: χ²=0.002, CV=0, MAD=0, p=1
## B_Cells: χ²=0, CV=0.001, MAD=0, p=1
## Mono_Macrophages: χ²=0.005, CV=0.001, MAD=0, p=1
## DC: χ²=0.007, CV=0.003, MAD=0, p=1
## Plasma: χ²=0.005, CV=0.002, MAD=0, p=1
## Smooth_muscle: χ²=0, CV=0.001, MAD=0, p=1
## Lymphatic: χ²=0.009, CV=0.003, MAD=0, p=1
## Endothelial: χ²=0.003, CV=0.001, MAD=0, p=1
## Stroma: χ²=0, CV=0.001, MAD=0, p=1
## Nerve: χ²=0.005, CV=0.002, MAD=0, p=1
## Enterocyte: χ²=0.002, CV=0.001, MAD=0, p=1
## TA: χ²=0.006, CV=0.002, MAD=0, p=1
## Goblet: χ²=0.002, CV=0.001, MAD=0, p=1
## Enteroendocrine: χ²=0.044, CV=0.024, MAD=0, p=1
## Cycling_TA: χ²=0.007, CV=0.004, MAD=0, p=1
##
## EXCELLENT stratification quality
##
## Class distributions per fold:
## Fold 1: B_Cells:649, Cycling_TA:110, DC:162, Endothelial:399, Enterocyte:327, Enteroendocrine:19, Goblet:485, Lymphatic:133, Mono_Macrophages:265, Nerve:152, Plasma:160, Smooth_muscle:452, Stroma:570, T_Cells:699, TA:213
## Fold 2: B_Cells:649, Cycling_TA:110, DC:162, Endothelial:399, Enterocyte:326, Enteroendocrine:18, Goblet:485, Lymphatic:133, Mono_Macrophages:265, Nerve:151, Plasma:159, Smooth_muscle:452, Stroma:570, T_Cells:699, TA:213
## Fold 3: B_Cells:649, Cycling_TA:110, DC:161, Endothelial:399, Enterocyte:326, Enteroendocrine:18, Goblet:485, Lymphatic:132, Mono_Macrophages:265, Nerve:151, Plasma:159, Smooth_muscle:452, Stroma:570, T_Cells:699, TA:213
## Fold 4: B_Cells:649, Cycling_TA:110, DC:161, Endothelial:398, Enterocyte:326, Enteroendocrine:18, Goblet:485, Lymphatic:132, Mono_Macrophages:264, Nerve:151, Plasma:159, Smooth_muscle:452, Stroma:570, T_Cells:698, TA:212
## Fold 5: B_Cells:649, Cycling_TA:109, DC:161, Endothelial:398, Enterocyte:326, Enteroendocrine:18, Goblet:484, Lymphatic:132, Mono_Macrophages:264, Nerve:151, Plasma:159, Smooth_muscle:452, Stroma:570, T_Cells:698, TA:212
## ================================
## Performing 5-fold outer CV, using 10 cores
## Duration: 3.893764 mins
## Finished NestedCV training (outer+inner with leakage-safe averaging)
## Collected NestedCV results
ViewMetrics() — tabular summaries overall / per
fold.plot_misclassified_cells() — visualize errors spatially
or by fold.feature_importance() — model‑specific importance (e.g.,
RF Gini/perm).feature_expression() — expression of a gene/feature in
different views.ViewMetrics(STEAM.obj, view = "overall")
# To view specific folds: ViewMetrics(STEAM.obj, fold = 1)
plot_misclassified_cells(STEAM.obj, fold = 1)
# plot all misclassified cells across all folds
# plot_misclassified_cells(STEAM.obj)
feature_importance(STEAM.obj)
## CD19 aSMA CD3 CD45 Cytokeratin CD31
## 1371.7548 1293.9182 1067.6079 1038.3247 909.0043 671.8581
## MUC2 CD163 CD11c SOX9
## 648.6989 637.8452 615.7991 527.9136
feature_expression(STEAM.obj, "CD19", view = "pooled")
#feature_expression(STEAM.obj, "CD19", view = "facet")
The STEAMCorrection system leverages spatial neighborhood information to identify and correct misclassified cells from your cross-validation results. This approach is based on the biological principle that spatially adjacent cells often share similar tissue types or functional states.
STEAMCorrection works by:
1.Identifying misclassified cells from nested CV results 2.Finding spatial neighbors using k-nearest neighbors 3.Applying hierarchical voting where neighbors vote on the correct cell type 4.Making corrections iteratively with adaptive consensus thresholds 5.Tracking progress across multiple correction rounds
The algorithm uses a hierarchical voting system:
This automatically processes all cross-validation folds with sensible defaults:
k = 8 neighborsmin_consensus = 0.6 (60% agreement for primary
corrections)max_iterations = 10For optimal results, you may need to adjust parameters based on your data characteristics:
Parameter Guidelines:
# Check correction summary across all folds
correction_results <- STEAM.obj$spatial_anchor_analysis$correction_results
for(fold_name in names(correction_results)) {
fold_data <- correction_results[[fold_name]]
cat(sprintf("Fold %d: %.1f%% → %.1f%% accuracy (%d corrections)\n",
fold_data$fold_number,
fold_data$initial_accuracy * 100,
fold_data$final_accuracy * 100,
fold_data$total_corrections))
}
# Show only key iterations
IterativePlot(STEAM.obj, fold = 1, iterations = c(0, 1, 2, 3))
# Show correction progress for a specific fold
#IterativePlot(STEAM.obj, fold = 1)
# Bar chart showing accuracy gains per fold
AccuracyPairedBarChart(STEAM.obj, show_corrections = FALSE)
# Analyze when spatial correction works best
NeighborhoodHomogeneityAnalysis(STEAM.obj)
# Which tissue types benefit most from correction?
ErrorAnalysisPlot(STEAM.obj)
High success indicators:
Correction challenges:
Common Issues:
No corrections applied:
getMisclassifiedCells(STEAM.obj)min_consensus thresholdmax_voting_levelsToo many corrections:
min_consensusk (fewer neighbors)Poor correction quality:
NeighborhoodHomogeneityAnalysis() resultsSlow performance:
max_iterationsk values1.Start with defaults then tune based on your tissue characteristics 2.Examine results visually using the plotting functions 3.Validate corrections make biological sense for your system 4.Consider tissue boundaries - corrections near boundaries may be less reliable 5.Document parameters used for reproducibility 6.Cross-validate correction performance on held-out data when possible
sessionInfo()
## R version 4.4.1 (2024-06-14)
## Platform: aarch64-apple-darwin20
## Running under: macOS 15.6.1
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRlapack.dylib; LAPACK version 3.12.0
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## time zone: America/Denver
## tzcode source: internal
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] Seurat_5.3.0 SeuratObject_5.1.0 sp_2.2-0 STEAM_1.2
## [5] dplyr_1.1.4 patchwork_1.3.0 RANN_2.6.2 ggplot2_3.5.2
##
## loaded via a namespace (and not attached):
## [1] RColorBrewer_1.1-3 shape_1.4.6.1 rstudioapi_0.17.1
## [4] jsonlite_2.0.0 magrittr_2.0.3 spatstat.utils_3.1-3
## [7] farver_2.1.2 rmarkdown_2.29 vctrs_0.6.5
## [10] ROCR_1.0-11 matrixTests_0.2.3 spatstat.explore_3.4-2
## [13] htmltools_0.5.8.1 pROC_1.18.5 caret_7.0-1
## [16] sass_0.4.10 sctransform_0.4.2 parallelly_1.43.0
## [19] KernSmooth_2.23-26 bslib_0.9.0 htmlwidgets_1.6.4
## [22] ica_1.0-3 plyr_1.8.9 plotly_4.10.4
## [25] zoo_1.8-14 lubridate_1.9.4 cachem_1.1.0
## [28] igraph_2.1.4 mime_0.13 lifecycle_1.0.4
## [31] iterators_1.0.14 pkgconfig_2.0.3 Matrix_1.7-3
## [34] R6_2.6.1 fastmap_1.2.0 aricode_1.0.3
## [37] fitdistrplus_1.2-2 future_1.40.0 shiny_1.10.0
## [40] digest_0.6.37 colorspace_2.1-1 tensor_1.5
## [43] RSpectra_0.16-2 irlba_2.3.5.1 labeling_0.4.3
## [46] progressr_0.15.1 randomForest_4.7-1.2 spatstat.sparse_3.1-0
## [49] timechange_0.3.0 httr_1.4.7 polyclip_1.10-7
## [52] abind_1.4-8 compiler_4.4.1 proxy_0.4-27
## [55] doParallel_1.0.17 withr_3.0.2 viridis_0.6.5
## [58] fastDummies_1.7.5 MASS_7.3-65 lava_1.8.1
## [61] ModelMetrics_1.2.2.2 tools_4.4.1 lmtest_0.9-40
## [64] httpuv_1.6.16 future.apply_1.11.3 goftest_1.2-3
## [67] nnet_7.3-20 glue_1.8.0 nlme_3.1-168
## [70] promises_1.3.2 grid_4.4.1 Rtsne_0.17
## [73] cluster_2.1.8.1 reshape2_1.4.4 generics_0.1.3
## [76] recipes_1.3.0 gtable_0.3.6 spatstat.data_3.1-6
## [79] class_7.3-23 tidyr_1.3.1 data.table_1.17.0
## [82] spatstat.geom_3.3-6 RcppAnnoy_0.0.22 ggrepel_0.9.6
## [85] foreach_1.5.2 pillar_1.10.2 stringr_1.5.1
## [88] spam_2.11-1 RcppHNSW_0.6.0 later_1.4.2
## [91] splines_4.4.1 lattice_0.22-7 FNN_1.1.4.1
## [94] deldir_2.0-4 survival_3.8-3 tidyselect_1.2.1
## [97] miniUI_0.1.2 pbapply_1.7-2 knitr_1.50
## [100] gridExtra_2.3 scattermore_1.2 RhpcBLASctl_0.23-42
## [103] stats4_4.4.1 xfun_0.52 nestedcv_0.8.0
## [106] hardhat_1.4.1 timeDate_4041.110 matrixStats_1.5.0
## [109] stringi_1.8.7 lazyeval_0.2.2 yaml_2.3.10
## [112] evaluate_1.0.3 codetools_0.2-20 tibble_3.2.1
## [115] cli_3.6.5 RcppParallel_5.1.10 uwot_0.2.3
## [118] rpart_4.1.24 xtable_1.8-4 reticulate_1.42.0
## [121] jquerylib_0.1.4 dichromat_2.0-0.1 Rcpp_1.0.14
## [124] zigg_0.0.2 spatstat.random_3.3-3 globals_0.17.0
## [127] png_0.1-8 Rfast_2.1.5.1 spatstat.univar_3.1-2
## [130] parallel_4.4.1 gower_1.0.2 mclust_6.1.1
## [133] dotCall64_1.2 glmnet_4.1-10 listenv_0.9.1
## [136] viridisLite_0.4.2 ipred_0.9-15 scales_1.4.0
## [139] prodlim_2025.04.28 e1071_1.7-16 ggridges_0.5.6
## [142] purrr_1.0.4 rlang_1.1.6 cowplot_1.1.3